An Adaptive Sequential Monte Carlo Sampler 

Paul Fearnhead and Benjamin M. Taylor 
May 11, 2010 

Abstract 

(^ Sequential Monte Carlo (SMC) methods are not only a popular tool in the 

,__, analysis of state-space models, but offer an alternative to MCMC in situations 

o 

where Bayesian inference must proceed via simulation. This paper introduces 



o 



CO 



a new SMC method that uses adaptive MCMC kernels for particle dynamics. 
The proposed algorithm features an online stochastic optimization procedure 



> 

to select the best MCMC kernel and simultaneously learn optimal tuning 

ON 

,__! parameters. Theoretical results are presented that justify the approach and 

in 

O give guidance on how it should be implemented. Empirical results, based 

o 

on analysing data from mixture models, show that the new adaptive SMC 
> 

^c algorithm (ASMC) can both choose the best MCMC kernel, and learn an 

appropriate scaling for it. ASMC with a choice between kernels outperformed 



the adaptive MCMC algorithm of Haario et al. (1998) in 5 out of the 6 cases 
considered. 

Keywords: Adaptive MCMC, Adaptive Sequential Monte Carlo, Bayesian Mixture 
Analysis, Optimal Scaling, Stochastic Optimization. 

1 



1 Introduction 

Sequential Monte Carlo (SMC) is a class of algorithms that enable simulation from 
a target distribution of interest. These algorithms are based on defining a series of 
distributions, and generating samples from each distribution in turn. SMC was 
initially used in the analysis of state-space models. In this setting there is a 
time-evolving hidden state of interest, inference about which is based on a set of 



noisy observations (Gordon et al., 1993 Liu and Chen, 1998 Doucet et a!., 2001; 



Fearnhead, 2002). The sequence of distributions are defined to be the posterior 



distributions of the state at consecutive time-points given the observations up to 
those time points. More recent work has looked at developing SMC methods that 
can analyse state-space models which have unknown fixed parameters. Such 
methods introduce steps into the algorithm to allow the support of the sample of 
parameter values to change over time, for example by using ideas from kernel 



density estimation (Liu and West, 2001), or MCMC moves (Gilks and Berzuini, 



1999 Storvik, 2002; Fearnhead, 2002). 



Most recently, SMC methods have been applied as an alternative to MCMC for 



standard Bayesian inference problems. (Neal, 2001 Chopin, 2002; Del Moral et al 



2006 Fearnhead, 2008). In this paper the focus will be on methods for sampling 



from the posterior distribution of a set of parameters of interest. SMC methods for 
this class of targets introduce an artificial sequence of distributions that run from 
the prior to the posterior and sample recursively from these using a combination of 
Importance Sampling and MCMC moves. This approach to sampling has been 



demonstrated empirically to often be more effective than using a single MCMC 



chain (Jasra et al., 2007, 2008). There are heuristic reasons for why this may true 
in general: the annealing of the target and spread of samples over the support 
means that SMC is less likely to be become trapped in posterior modes. 
Simply invoking an untuned MCMC move within an SMC algorithm would likely 
lead to poor results because the move step would not be effective in combating 
sample depletion. The structure of SMC means that at the time of a move there is 
a sample from the target readily available, this can be used to compute posterior 



moments and inform the shape of the proposal kernel as in Jasra et al. (2008); 
however, further refinements can lead to even better performance. Such 
refinements include the scaling of estimated target moments by an optimal factor, 



sec 



Roberts and Rosenthal (2001[) for example. For general targets and proposals 



no theoretical results for the choice of scaling exist, and this has led to the recent 



popularity of adaptive MCMC (Haario et al., 1998 Andrieu and Robert, 2001 



Roberts and Rosenthal, 2009; Craiu et al., 2009; Andrieu and Thorns, 2008). In 



this paper the idea of adapting the MCMC kernel within an SMC algorithm will 

be explored. 

To date there has been little work at adapting SMC methods. Exceptions include 



the method of Jasra et al. (2008), whose method assumes a likelihood tempered 



sequence of target densities (see Neal (2001)) and the adaptation procedure both 



chooses this sequence online, as well as computing the variance of a random walk 
proposal kernel used for particle dynamics. Cornebise et al. (2008|) also considers 



adapting the proposal distribution within SMC for state-space models. Assuming 

that the proposal density belongs to a parametric family with parameter 6, their 

method proceeds by simulating a number of realisations for each of a range of 
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values of 9 and selecting the value that minimises the empirical Shannon entropy 
of the importance weights; new samples are then re-proposed using this 



approximately optimal value. Further related work includes that of Douc et al. 



(2005 ) and Cappe et al. (2008 ) on respectively population Monte Carlo and 
adaptive importance sampling. 

The aims of this paper are to introduce a new adaptive SMC algorithm (ASMC) 
that automatically tunes MCMC move kernels and chooses between different 
proposal densities and to provide theoretical justification of the method. The 
algorithm is based on having a distribution of kernels and their tuning parameters 
at each iteration. Each current sample value, called a particle, is moved using an 
MCMC kernel drawn from this distribution. By observing the expected square 



jumping distance (Craiu et al., 2009; Sherlock and Roberts, 2009) for each particle 



it is possible to learn which MCMC kernels are mixing better. The information 
thus obtained can then used to update the distribution of kernels. The key 
assumption of the new approach is that the optimal MCMC kernel for moving 
particles does not change much over the iterations of the SMC algorithm. As will 
be discussed, and shown empirically, in section [5] this can often be achieved by 
appropriate parameterisation of a family of MCMC kernels. 

The structure of the paper is as follows. In the next section, the model of interest 
will be introduced and followed by a review of MCMC and SMC approaches. Then 
in Section [3j the new adaptive SMC will be presented. Guidelines on implementing 
the algorithm as well as some theory on the convergence will be presented in 
Section |4} In Section [5] the method will be evaluated using simulated data. The 

results show that the proposed method can successfully choose both an 
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appropriate MCMC kernel and an appropriate scaling for the kernel. The paper 
ends with a discussion. 

2 Model 

The focus of this paper will be on Bayesian inference for parameters, 9, from a 
model where independent identically distributed data is available. Note that the 
ideas behind the proposed adaptive SMC algorithm can be applied more generally 
(see section |6J). Let ir(9) denote the prior for 9 and n(y\9) the probability density 
for the observations. The aim will be to calculate the posterior density, 

n 

n(9\ yiM ) <xir(9)Y[ir( yi \e), (1) 

1 = 1 

where, here and throughout, it will be used to denote a probability density, and 
y 1:t means y 1 ,...,y t . 

In general, 7r(9\yi :n ) is analytically intractable and so to compute posterior 
functionals of interest, for example expectations, Monte Carlo simulation methods 



are often employed. Sections |2.1| and |2.2| provide a brief description of two such 
Monte Carlo approaches. 

2.1 MCMC 

An MCMC transition kernel, Kh, is an iterative rule for generating samples from a 

target probability density, for example a posterior. K^ comprises a proposal kernel, 

here and throughout denoted qh (the subscript h indicates dependence on a tuning 

parameter) and an acceptance ratio that depends on the target and, in general, the 
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proposal densities (see Gilks et al. (1995); Gamerman and Lopes (2006) for reviews 



of MCMC methodology). The most generally applicable MCMC method is 
Metropolis-Hastings, see Algorithm [T| ( [Metropolis et al., 1953 Hastings, 1970). 



Algorithm 1 Metropolis-Hastings Algorithm (Metropolis et al., 1953 Hastings 



1970) 



1: Start with an initial sample, 9^°\ drawn from any density, ix$. 

2: for j = 1,2, ... do 

3: Propose a move to a new location, 9, by drawing a sample from g^(^^ _1 \ 6). 

4: Accept the move (ie set #W = #) with probability, 

\ '7T(0 ( *-%l:»)<Zh(0e~l),6O ( 

else set 6® = 6^1 
5: end for 



Probably the simplest MH algorithm is the random walk Metropolis (RWM). The 

proposal kernel for RWM is a symmetric density centred on the current state, the 

most common example being a multivariate normal, 

Qh{0 i $) — A/"(0; 0^ -1 ), /i 2 ^), where S^ is an estimate of the target covariance. 

Both the values of S^ and h are critical to the performance of the algorithm. If £„ 

does not accurately estimate the posterior covariance matrix, then the likely 

directions of the random walk moves will likely be inappropriate. On the other 

hand, a value of h that is too small will lead to high acceptance rates, but the 

samples will be highly correlated. If h is too large then the algorithm will rarely 

move, which in the worst case scenario could lead to a degenerate sample. 

These observations on the role of h point to the idea of an optimal scaling, a h 

somewhere between the extremes that promotes the best mixing of the algorithm. 

In the case of elliptically symmetric unimodal targets, an optimal random walk 

scaling can sometimes be computed numerically; this class of targets includes the 
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Multivariate Gaussian (Sherlock and Roberts, 2009). Other theoretical results 



include optimal acceptance rates which are derived in the limit as the dimension of 



9, d — > oo (see Roberts and Rosenthal (2001 ) for examples of targets and 



proposals). In general however, there are no such theoretical results. 

One way of circumventing the need for analytical optimal scalings is to try to learn 



them online (Andrieu and Robert, 2001; Atchade and Rosenthal, 2005), this can 



include learning both a good scaling, h, and estimating the target covariance, T, w 



(Haario et al., 1998). Recent research in adaptive MCMC has generated a number 



of new algorithms (see for example Andrieu and Thorns (2008); Roberts and 



Rosenthal (2009); Craiu et al. (2009)), though some care must be taken to ensure 



that the resulting chain has the correct ergodic distribution. 

2.2 Sequential Monte Carlo 

An alternative approach to generating samples from a posterior is to use sequential 
Monte Carlo (SMC, see Del Moral et al. (2006[ ) for a review). The main idea 
behind SMC is to introduce a sequence of densities leading from the prior to the 
target density of interest and to iteratively update an approximation to these 
densities. For the application considered here, it is natural to define these densities 
as n t (9) = n{0\yi : t) for t = 1, . . . , n; this 'data tempered' schedule will be used in 
the sequel. The approximations to each density are defined in terms of a set of 



weighted particles, {9^ , w\ }y£i, produced so that as M — > oo, Monte Carlo sums 



converge to their 'correct' expectations: 



't 






i in L 'm * = ^mWt% 



for all TTj-integrable functions, /. One step of an SMC algorithm can involve 
importance reweighting, resampling and moving the particles via an MCMC kernel 



(Gilks and Berzuini, 1999 Chopin, 2002). For concreteness, this paper will focus 



on the iterated batch importance sampling (IBIS) algorithm of|Chopin (2002). 



The simplest way to update the particle approximation in model Q is to let 
0\ = 9)r\ and w t = w t _ 1 7i(y t \6 i f ). However such an algorithm will degenerate 
for large t, as eventually only one particle will have non- negligible weight. Within 
IBIS, resample-move steps (sometimes referred to here as simply 'move steps') are 
introduced to alleviate this. In a move step, the particles are first resampled so 
that the expected number of copies of particle 6\ is proportional to w t . This 
process produces multiple copies of some particles. In order to create particle 
diversity, each resampled particle is moved by an MCMC kernel. The MCMC 
kernel is chosen to have stationary distribution ir t . The resulting particles are then 
assigned a weight of 1/M. 
The decision of whether to apply a resample-move step within IBIS is based on the 



effective sample size (ESS, see Kong et al. (1994); |Liu and Chen (1998)). The ESS 



is a measure of variability of the particle weights; using this to decide whether to 



resample is justified by arguments within Liu and Chen (1995) and Liu et al. 



(1998). Full details of IBIS are given in Algorithm [2 



Chopin's IBIS algorithm is a special case of the resample-move (RM) algorithm of 
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Algorithm 2 Chopin's IBIS algorithm 



1: Initialise from the prior {9q , Wq }fLi ~ 7To- 



10 
11 
12 



for t — 1, . . . ,n do 

Assume current {9t-ii w t~i}j^=i ~ ^t-i 

Reweight w t (i) = w£W0t-\)M-i(0t-i)- Result: {^-D™?')}^ ~ 7T t . 

if particle weights not degenerate (see text) then 

else 

Resample: let /C = {ki, . . . , fe^f} ^ {!)•••> -^} De the resampling indices, 
then {#£_!, 1/M}fe e x: ~ ^t- Relabel: fcj <— j, the jth resampling index so 
that {eg?!, l/M}f =1 „ « t . 

Move via ^-invariant MCMC kernel. Result: {6 { t j \ l/M}f =1 ~ n t . 
end if 
end for 



Gilks and Berzuini (1999) and the general algorithm described by Del Moral et al. 



(2006) (note that the latter method applies beyond MCMC-within-SMC and 



provides a unifying framework for sampling from sequences of targets). The main 



difference between RM and IBIS is that, in their presentation of RM, Gilks and 



Berzuini (1999) use resampling and move steps at each iteration of the sampler. 
Chopin noticed that at a particular iteration it may be better to just reweight the 
particles, rather than incur the computational cost and degeneracy induced by a 
resample-move step. Another related algorithm, a development of simulated 



annealing (Kirkpatrick et al., 1983) due to Neal (2001), utilises an alternative 

'likelihood tempered' sequence of targets. The proposed target sequence is 

7r t (6) = 7r(9)7r(yi- n \9Y t , where {£ t } is a sequence of real numbers starting at (the 

prior) and ending on 1 (the posterior). Since each move step requires evaluation of 

the likelihood over all available observations, the main disadvantage of likelihood 

tempering is computational cost, although for models with sufficient statistics this 

is not an issue. Further disadvantages of Neal's proposed algorithm are the 
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absence of resampling steps which eventually leads to sample degeneracy; and the 
lack of interpretability of intermediate target densities. 
The efficiency of an SMC algorithm, such as IBIS, depends on the mixing 
properties of the associated MCMC kernel. Within SMC there is the advantage of 
being able to use the current set of particles to help tune an MCMC kernel. For 
example, the weighted particles can give an estimate of the posterior covariance 
matrix, which can be used within a random walk proposal. However even in this 



case, the proposal variance still needs to be appropriately scaled (Roberts and 



Rosenthal, 2001 Sherlock and Roberts, 2009). In the next section the new 



adaptive SMC procedure will be introduced, the algorithm can learn an 
appropriate tuning for the MCMC kernel, and can also be used to choose between 
a set of possible kernels. 

3 The Adaptive SMC Sampler 

First consider the case where the move step in the IBIS algorithm involves one 
type of MCMC kernel. Let n t be an arbitrary continuous probability density (the 
target) and Kh,t a vr t -invariant MCMC kernel with tuning parameter, h. The 
parameter h is to be chosen to maximise the following utility function, 



g {t \h) = J irt(0t-i)K h ,t(e t -iA) HOt-iA)d8t-id0 t , (3) 

= E[A(0 t _!,0 t )], 

where A(6 t -i, t ) > is a measure of mixing of the chain. Most MCMC adaptation 



criteria can be viewed in this way (Andrie^i and Thorns, 2008). Note that for 



simplicity of presentation, A only depends on the current and subsequent state, 
though the idea readily extends to more complex cost functionals, for example 
involving multiple transitions of the MCMC chain. The function g^' is the average 
performance of the chain with respect to A, which would normally be some 
measure of mixing. The ideal choice for A would be the integrated autocorrelation 
time (whence the goal would be to maximise —g^), but a computationally simpler 
measure of mixing is the expected square jumping distance (ESJD). Maximising 
the ESJD is equivalent to minimising the lag-1 autocorrelation; this measure is 



often used within adaptive MCMC, see for example Sherlock and Roberts (2009); 



Pasarica and Gelman (2010) 



In the following it will be assumed that the proposal distribution can depend on 
quantities calculated from the current set of particles (for example estimates of the 
posterior variance), but this will be suppressed in the notation. The main idea of 
ASMC is to use the observed instances of A(6>j_i, 9 t ) to help choose the best h. 
The tuning parameter will be treated as an auxiliary random variable. At 
time-step t the aim is to derive a density for the tunings, tp^(/i). If a move step is 
invoked at this time, a sample of M realisations from ir^\h), denoted {h t jj^, 
will be drawn and 'allocated' to particles at random. 

When moving the jth resampled particle, the tuning parameter h\ J will be used 
within the proposal distribution. For notational simplicity this value will be 
denoted h in the following. Let 6^\ be the jth resampled particle (see step 9 of 
Algorithm 2). In moving this particle, 9 t is drawn from qh(0^_ 1 , ■ ), and accepted 
with probability a/ l (6' t _ 1 , Q\ ), given by (2). If the proposed particle is accepted 



then 6 { t j) = d\ j) otherwise d\ j) = 0^. 
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The utility function in ^ simplifies to, 



g (t) (h)= / 7i t (e t ^)q h (e t ^,e t )A(e t ^,e t )de t ^de tl 



where 



A(^U 0) ) = a^lj^k^lj?) 



Since by assumption the resampled particles are approximately drawn from ir t and 
proposed particles are drawn from qh(9t~i,9t), the quantity A(6' t _ 1 ,6' t ) can be 
viewed as an unbiased estimate of g^\h). 

The approach in this paper is to use the observed A(6 l t _ 1 , 9\ f ) to update the 
distribution 7r^\h) to a new distribution 7r^ +1 ^ (h). In particular each h t J will be 
assigned a weight, /(A(^_ 1 ,^ t )), for some function / : IR + — > IR + . The new 
density of scalings will be defined, 



M 



^ t+1 \h) oc J2 fiWA, W ] ))R(h - h?), (4) 



j=i 



where R(h — h t ) is a density for h which is centred on h^ . Simulating from 
n^ t+1 \h) is achieved by first resampling the h t J s with probabilities proportional to 
their weight and then adding noise to each resampled value; the distribution of this 
noise is given by R( ■ ). The motivation for adding noise to the resampled /i-values 
is to avoid the distributions 7P*'(/i) degenerating too quickly to a point-mass on a 
single value. Similar ideas are used in dynamic SMC methods for dealing with 



fixed parameters, for example West (1993); Liu and West (2001). In practice the 
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variance of the noise can depend on the variance of n^'(h) and by analogy to 
Kernel density estimation should tend to as the number of particles gets large. 
If there is no resampling at step t then set ir^ t+1 \h) = Tx^'{h). The scheme is 
initiated with an arbitrary distribution n(h). The specific choice of / considered in 
this paper is a simple linear weighting scheme, 

/(A) = o + A, a > 0. 

Theoretical justification for this choice is given in the next section. 

One assumption of the proposed approach is that a good choice of h at one 

time-step will be a good choice at nearby time-steps. Note that this is based on an 



implicit assumption within SMC that successive targets are similar (see Chopin 



(2002); Del Moral et al. (2006) for example). Furthermore, using estimates of 
posterior variances within the proposal distribution can also help ensure that good 
values of h at one time-step will be a good choice at nearby time-steps. Some 
theoretical results concerning this matter will be presented in Section |4} 
To choose between different types of MCMC kernel is now a relatively 
straightforward extension of the above. Assume there are / different MCMC 
kernels, each defined by a proposal distribution qv, where i E {1, . . . , I}. The 
algorithm now learns a set of distributions, 7T^'(h,i), for the pair of kernel type 
and associated tuning parameter. Each particle is assigned a random kernel type 
and tuning drawn form this distribution, with the pair, (hl_i,i t -i), associated with 
t x . The algorithm proceeds by weighting this pair based on the observed 
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X{6t\,9t) values as before, and updating the distribution, 

M 

^\h, *) oc ]T f&V&, ~^ ] ))R{h - h^S^ii). (5) 

where 6.q) (i) is a point mass on i = if\. The method is described in detail below, 
% t-x 

see Algorithm [3j Within the specific implementation described, the sample of 

pairs, (h,i), from 7r^(h,i) are allocated to particles randomly immediately after 

the resample-move step at iteration t. These pairs are then kept until the next 

iteration a resample-move step is called. 

Algorithm 3 The Adaptive SMC algorithm. Here, 7i"o( •),..., 7T n ( • ) are an arbitrary 
sequence of targets; an MCMC kernel is assumed for particle dynamics. 

1: Initialise from the prior {9q , Wq jfLi ~ ^o- 

2: Draw a selection of pairs of MCMC kernels with associated tuning parameters, 

{{h { ^\K^l)}f =1 = {(h { j \i { j) )}f =1 ~ n(h,i), and attach one to each particle 

arbitrarily. 
3: for t = 1, ... ,n do 
4: Assume current {Ot-ii w t-i}jLi ~ n t-i 



5 

6 

7 

8 
9 

10 
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Reweight w { t j) = ^^(CiV^-i^S)- Result: {O^wf}^ ~ 7r t . 
if particle weights not degenerate (see text) then 

*->>t+i. 

else 

Resample: let /C = {ki, . . . , fc^-} ^ {!;•••> ^} be the resampling indices, 
then {9i_\, l/M} keJ c ~ rc t . Relabel: kj <— j, the jth resampling index so 

that {0{_i, 1/M}^£ 1 ~ n t . DO NOT resample kernels or tuning parameters 
at this stage. 
12: Move 6l_ l via the ^-invariant MCMC kernel, K^j., and tuning parameter 

h t i, denote the proposed new particle as 9\ and accepted/rejected particle 

as 6\ 3) . Result: {6 { t 3) , l/M}f =l ~ vr t . 

13: To obtain {(h[ j \ K { h j) t )}^ =1 = {(h { t j) ,i { t j) )}, sample M times from Q. Allo- 

cate the new selection to particles at random. 
14: end if 
15: end for 
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4 Theoretical Results 

In this section the proposed algorithm will be justified by a series of theoretical 
results; guidance as to how it should best be implemented will also be given. The 
results presented here apply in the limit as the number of particles, M — > oo. As 
discussed above, in this limit, the variance of the kernel R( ■ ) in Q tends to 0. 
To simplify the discussion, it will be assumed that tunings are one dimensional 
(the arguments presented extend readily to the multivariate case). For a slight 
notational simplification, the criterion A will be used, rather than A (as suggested 
in algorithm [3]) ; this does not affect the validity of any of the arguments, which 
also hold for A. The section is split into two parts. 



Firstly, in section |4.1[ it is of interest to examine what happens to the distribution 
of the hs after one step of reweighting and resampling; this result will lead to a 
criterion for the choice of weight function that guarantees MCMC mixing 



improvement with respect to A. In section 4.2 the sequential improvement of hs 
will be considered over many steps of the ASMC algorithm and with a changing 
target. General conditions for convergence of ASMC to the optimal kernel and 
tuning parameter will be provided. 

4.1 One Step Improvement and Weighting Function 

In this section and in the relevant proofs, it is appropriate to temporarily drop the 
t superscript, eg g® = g, Q t -\ = and 9 t = 9'. To study the effect of reweighting 
and resampling on the distribution of the hs, suppose that currently 
{hV'}jLi ~ ff{ti), the pdf of a random variable, H. The dependence on current and 
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proposed particles means the weight attached to hP' is random, but also, due to 
the independence of h with the particles, is an unbiased estimator of the 'true' 
weight, K& t Qi\H[f(A)\H = h^'], where Ee,e'|H denotes the expectation with respect 
to the joint density of the random variables and G' conditional on H. The true 
weighting function will be denoted, 

w(h)=E e>& \ H [f{A)\H = h). (6) 

The following proposition, which is used repeatedly in subsequent results, shows 
how reweighting and resampling affects ir(h). 

Proposition 4.1 Suppose that currently {/i^- ) }*£ 1 ~ vr(/i), the pdf of a random 
variable, H, independent of 8. Let w(h) be the weighting function defined as in 
(6|). Then in the limit as M — > oo, the distribution of the reweighted and 
subsequently resampled hs is, 



**(*) W{h)Ak) 



f w{h)ir{h)dh' 

Proof: See Appendix \A\ □ 

Since ASMC uses a selection of hs, it is appropriate as a starting point to look for 
conditions under which their distribution is improved. It would be desirable if, 
over 7T*(/i), the objective function would on average take a higher value, for then 
the new distribution would on average perform better with respect to A than the 
old. This criterion can be stated in mathematical form: conditions on / are sought 
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for which, 

f n*{h)g(h)dh> I n{h)g(h)dh. 



Lemma 4.1 Assuming g is Tr(h)-integrable, in the limit as M — > oo, 

E n , w [g(h)] > E w(h) [g(h)} <^> cov n{h) [g(h), w(h)} > 0. (7) 

That is, provided there is positive correlation between the objective function g(h) 
and the weighting function, w{h), the new distribution of hs will on average 
perform better (on g(h)) with respect to A than the old. 

Proof: The result is obtained by expanding definitions in O: 

E«* {h) [g{h)} > E n{h) [g(h)}, 
■^^ E n{h) [w(h)g(h)} > E^ h) [w(h)]E^ h) [g(h)], 
■^^ cov w ( ft )[flf(/i),ty(/i)] > 0. 



□ 



Although this result does not directly yield a general form for /, it does give a 
simple criterion that must be fulfilled by any candidate function. An immediate 
corollary gives more concrete guidance: 

Corollary 4.1 A simple linear weighting scheme, /(A) = a + A, where a > 0, 
satisfies |^). 



Proof: This is trivially verified using the linearity property of the covariance. □ 
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A consequence of this lemma is that the ASMC algorithm with linear weights will 
lead to sequential improvement with respect to A under very weak assumptions on 
the target and initial density for h. A linear weighting scheme may at first glance 
seem sub-optimal, and that it should be possible to learn h more quickly using a 
function /(A) that increases at a super-linear rate. The present authors conjecture 
that such functions will not always guarantee an improvement in the distribution 
of h. For example consider /(A) = A 2 , where the weighting function takes the 
form, w(h) = g(h) 2 + V[A|if = h]. Because of the V[A|if = h] term, which may be 
large for values of h where g(h) is small, it is no longer true that 
cov n (h)[g(h),w(h)] > in general. 

4.2 Convergence Over a Number of Iterations 

The goal of this section is to provide a theoretical result concerning the ability of 
ASMC to update the distribution of hs with respect to a sequence of targets, 
7Ti(#i), . . . ,TT n (9 n ). To simplify notation, it will be assumed that a move occurs at 
each iteration of the algorithm. The result can be extended to the case where 
moves occur intermittently, providing they incur infinitely often in the limit as the 
number of data points goes to infinity. 
Define a set of functions, {g^(h)}^ =1 , 

g (t) (h) = J KtiOt^KhtiOt^AWt-iAWt-iMt > o, 

where, for each t, Kh,t is a 7r t -invariant MCMC kernel. 
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For a linear weighting scheme, 



7T« 



(h)<x«(h)i[(° + 9 { 'Kh))- 



s=l 



Below it will be shown that as t — >■ oo if the sequence of functions, {g^'(h)}, 
converge to a fixed function, g(h), and if g has a unique global maximum, /i opt , 
then n^\h) will converge to a point mass on /i opt . 

The key assumption of this theorem regards the convergence of the functions 
{g^\h)}. This assumption is linked to the idea that a good value of h for the 
target at time t is required to be a good value at times later on. As mentioned 
above, the motivation behind SMC is that successive targets should be similar. 
Moreover, standard Bayesian asymptotic theory shows that as the number of 
observations, n, tends to infinity, the posterior tends in distribution to that of a 
Gaussian random variable. Thus, providing information from the current 
parameters about the posterior variance is used appropriately, it should be 
expected that the sequence of functions, {g^(h)}, would also converge. This issue 
will be explored empirically in the next section. 

Theorem 4.1 Let ir(h) be the initial density for the tuning parameter with 
support HCl and a > 0. Dehne, as above, 



irW 



(h)cxir(h)]l(a + gM(h)). 
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Suppose there exists a function g : Ti — »■ IR>o such that 



sup |# w - g\ < k g r a } a e (0, 1), A; s > 0. 
hen 



Furthermore, suppose g has a unique global maximum, h opt , contained in the 
interior ofH and that g is twice differentiable in an interval containing h opt . Then 
as t — y oo, Tc^\h) tends to a Dirac mass centred on the optimal scaling, h opt . 

Proof: See Appendix [Bj □ 



5 Results 



This section is organised as follows. In Section 5.1, the convergence of h to an 



optimal scaling will be demonstrated empirically using a linear Gaussian model. 



Then in Section 5.2 the problem of Bayesian mixture analysis will be introduced. 



In Sections 5.3 and 5.4 the proposed method will be evaluated in simulation 
studies using the example of Bayesian mixture posteriors as defining the sequence 
of targets of interest. 



Following Sherlock and Roberts (2009), the expected (Mahalanobis) square 



jumping distance will be considered as an MCMC performance criterion: 

A(0 t _i,0 t ) = (6U - e t ) T ±-l{9 t -i - e t ), 

where 6 t -i and t are two points in the parameter space and T, nt is an empirical 
estimate of the target covariance obtained from the current set of particles. 
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Two different MCMC kernels will be considered within the SMC algorithm, these 
are defined by the following two proposals: 

qUOt-iA) = Af(ae t -i + (l-a)9t,h 2 t n ), fee (0,1], a = Vl-h 2 , 

where 9 t and E ffi are respectively estimates of the target and covariance. The first 
of these is a random-walk proposal. The second is based upon a method for 



updating parameter values in Liu and West (2001), here named the 'Liu/West 



proposal. The Liu/West proposal has mean shrunk towards the mean of the target 
and the imposed choice of a = \/I — h 2 sets the mean and variance of proposed 
particles to be the same as that of the current particles. Note that if the target is 
Gaussian, then this proposal can be shown to be equivalent to a Langevin proposal 



(Roberts and Tweedie, 1996). 



5.1 Convergence of h 

It is of interest to see an example g(h) and demonstrate convergence of one of the 
proposed algorithms to the optimal scaling. This will be achieved using a Gaussian 
target, for which a useable analytic expression for the optimal scaling for the 
random walk kernel is available. The results in this section are based on 100 
observations simulated from a 5-dimensional standard Gaussian density, 
2/i:ioo ~ <M"(0, I5), where I5 is the 5x5 identity matrix. The observation variance 
was assumed to be known and therefore the probability model, or likelihood, was 
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specified as, 

n(y\e)=M(y;0,I 5 ). 

The prior on the unknown parameter, 9, the vector of means, was set to A/"(0, 5I5). 
ASMC with a random walk proposal was used to generate M = 2000 particles 
from the posterior. Resampling was invoked when the ESS dropped below M/2 
and no noise was added to the hs after resampling. The initial distribution for h 
was chosen to be uniform on (0, 10). Note that this model admits exact inference 
via the Kalman Filter. 

The left hand plot in Figure [I] shows g(ti) for this target. Note in this case that the 
sequence of functions, {g®}, does not change much since each intermediate target 
is exactly Gaussian and the proposal is scaled by the approximate variance of the 
target. The optimum scaling, h opt , was computed using 1-dimensional numerical 



integration and Theorem 1 of Sherlock and Roberts (2009). The right hand plot 



illustrates several features of the adaptive RWM; the resampling frequency, that 
the algorithm does indeed converge to the true optimal scaling and the 
approximate rate of this convergence. 

5.2 Bayesian Mixture Analysis 

The ability of the ASMC algorithm to learn MCMC tuning parameters in more 
complicated scenarios was evaluated using simulated data from mixture likelihoods 



(for a complete review of this topic, see Fruhwirth-Schnatter (2006)). Let 



Pi, . . . ,p r > be such that Yll=iPi = 1- Let J\f( ■ ; fx, v) denote the normal density 

function with mean /1 and variance v. Let 9 = {pi :r -x,Vi. r , jWi :r }. 
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mean with 2.5-97.5% range 



r \\\\\ mnm 



50 100 



Observation Number 



Figure 1: Left plot: g(h) for a 5-dimensional Gaussian Target, explored with RWM 
and with ESJD as the optimization criterion. Right hand plot: convergence of h 
for the same density based on 100 simulated observations; the horizontal line is the 
approximately optimal scaling, 1.06. 

The likelihood function for a single observation, ?/j,is 



7T 






,Vj). 



The prior 9 was multivariate normal, on a transformed space using the generalised 
logit scale for the weights, log scale for variances, and leaving the means 
untransformed. The components of 9 were assumed independent a priori] the 
priors were log(pj/p r ) ~ jV(0, l 2 ), log(^) ~ W(-1.5, 1.3 2 ) and fjtj ~ A/"(0, 0.75 2 ), 
where j = 1, . . . , r — 1 in the case of the weights and j = 1, . . . , r for the means 
and variances. The MCMC moves within the SMC algorithm were performed in 
the transformed space, using the appropriate inverse transformed values to 
compute the likelihood in (JsJ) . 
An issue with mixture models is that for the above choice of prior, the likelihood 



and posterior are invariant to permutation of the component labels (Stephens, 
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2000). As a result the posterior distribution has a multiple of r! modes, 
corresponding to each possible permutation. One way of overcoming this problem 
is by introducing a constraint on the parameters, such as labelling the components 
so that fii < fi 2 <■■■< fi r , or so that v i < V2 < ■ • • < v r . This choice will affect 
the empirical moments of the resulting posterior and hence the proposal 
distribution of the MCMC kernel - both the random walk and Liu/West proposals 
depend on the posterior covariance, the latter also depending on the mean. In 
particular if there is a choice of ordering whereby the posterior is closer to 
Gaussian, then this is likely to lead to better mixing of the MCMC kernels. This 
phenomenon motivates the idea that it is also possible to choose between orderings 
on the parameter vector, which will be investigated in the sequel. 

5.3 Details of Implementation of ASMC 

In analysing the simulated data, a number of SMC and ASMC algorithms were 
compared. These correspond to using the following MCMC kernels: 

RWfixed Random walk ordered by means, with h chosen based on the theoretical 



results for Gaussian targets (Roberts and Rosenthal, 2001 Sherlock and 



Roberts, 2009[ ). 

RWadaptive Adaptive random walk ordered by means with uniform prior on h 

LWmean Adaptive Liu/West proposal ordered by means. 

LWvariance Adaptive Liu/West proposal ordered by variances. 

Kmix Adaptive choice between random walk ordered by means, Liu/West 
ordered on means and Liu/West ordered on variances. 



In each case the reference to ordering relates to how the component labels were 
defined, and thus affect the estimate of the posterior mean and covariance used. 
The above methods were also compared with the adaptive MCMC algorithm of 



Haario et al. (1998), denoted AMCMC. The specific implementation used here is 
as follows. The prior densities were identical to those for ASMC, the parameter 
vector was ordered by means and the random walk tuning was computed using the 



approximately optimal Gaussian scaling of h = 2.4/i/3r — 1. The AMCMC 
algorithm was run for 12000 iterations for the 5 dimensional datasets and for 
30000 iterations for the 8 dimensional datasets: these values were chosen so as to 
approximately match the number of likelihood computations involved between the 
ASMC and AMCMC methods. The burn-in period was set to half of the number 
of iterations and the method was initialised by a draw from the prior. There was 
an initial non-adaptive phase, lasting 1000 iterations, where the proposal kernel 
was scaled by the prior covariance and after which scaling was via estimates of the 
posterior covariance computed from the chain to-date, this was updated every 100 
iterations. 

For the ASMC algorithms, the initial distribution of hs was chosen to be uniform 
on (0, 2) for the random walk and on (0, 1) for the Liu/West proposal. In the case 
of the random walk, this range of hs can be justified by considering the optimal 
scaling for a random walk Metropolis on a multivariate Gaussian target in 5 
dimensions namely 2.38/v / 5 = 1.06 (and decays with increasing dimension as 
0(gT 1/2 )). For the Liu/West, h must be in (0,1]. 
In each case a Gaussian kernel with variance 0.015 2 was used in Q. A sensitivity 

analysis showed the effect of changing the variance of the noise slightly did not 
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affect the conclusions of this research. The parameter for the linear /i-weighting 
scheme was a = 0. If any h was perturbed below zero, a small value, 1 x 1CT 6 , was 
imputed and similarly for the Liu/West approach, any h perturbed above 1 was 
replaced by 1. 

The number of particles was set to M = 2000 for the 2-mixture datasets and 
M = 5000 for the 3-mixture datasets. Each algorithm was run 100 times on each 
dataset with the order of observations randomised each time. For the MCMC 



based methods an ESS tolerance of M/2 was used, as in Jasra et al. (2007). 



Resampling of the particles was via residual sampling (Whitley, 1994 Liu et al., 



1998), but multinomial sampling was used in selecting hs. For ease of computing 
posterior quantities of interest, each of the above algorithms was forced to 
resample and move on the last iteration. 

To compare the performance of different methods, a measure of the accuracy of the 
estimated predictive density was used. This is advantageous because it is invariant 
to re-labelling of the mixture components. The chosen accuracy measure was the 
variability of the predictive density (VPD) and was calculated as follows. Each run 
of the algorithm produces a weighted particle set, from which an estimate of 
E[7r(yW|yi : „)] can be obtained at 100 points, {y^}}^{, equi-spaced between -2.5 
and 2.5. For each i, the 100 simulation runs produce 100 realisations of 
E[7r(2/W|2/ 1:n )]; let y^' be the estimate of yW obtained from run j. The VPD 
measure used in this paper is 



mean j [var^ (y^ M - 1 ) ] 
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where mearij is the mean over the is and var, is the variance of the estimates of y^ 
obtained from the 100 simulations. The VPD gives an indication of the global 
variability of the predictive density across the simulations. In the tables, the 
relative VPD is used, which gives a scale-free comparison between methods. The 
SMC/ASMC algorithm with a relative VPD of 1 is the reference algorithm and has 
the smallest VPD of the SMC/ASMC methods; larger values indicate higher 
VPDs. For the AMCMC methods, the predictive densities were computed using all 
available samples ie with 6000 for the 2-mixture datasets and 15000 for the 
3-mixture datasets. For the SMC/ASMC methods a Rao-Blackwellised version of 
the predictive density was computed using all current and proposed particles 
available from the last iteration (that is, using 4000/10000 sample points 
respectively for the 2/3-mixture datasets). 

5.4 Results 

100 realisations from were simulated from the following likelihoods: 



Dataset 1: 7i(y 

Dataset 2: 7c(y 

Dataset 3: 7i(y 

Dataset 4: n(y 

Dataset 5: n(y 

Dataset 6: 7i(y 



0) = 0.5A%; -0.25, 0.5 2 ) + 0.5A%; 0.25, 0.5 2 ), 

0) = 0.5A%; 0, l 2 ) + 0.5A%; 0, 0.1 2 ), 

0) = 0.3A%; -1, 0.5 2 ) + 0.7A%; 1, 0.5 2 ), 

0) = o.5A%; -0.75, 0.1 2 ) + 0.5A%; 0.75, 0.1 2 ), 

0) = 0.35A% -0.1, 0.1 2 ) + 0.3A%; 0, 0.5 2 ) + 0.35A%; 0.1, l 2 ), 

0) = 0.25A% -0.5, 0.1 2 ) + 0.5A%; 0, 0.2 2 ) + 0.25A%; 0.5, 0.1 2 ), 
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This choice of datasets in combination with the selection of MCMC kernels allows 
several hypotheses to be tested empirically. Firstly, by comparing the performance 
of RWfixed with RWadaptive in these cases, it is possible to see whether anything 
is lost or gained by adapting the proposal kernel. Secondly, the impact of the 
different kernel orderings on MCMC mixing will become apparent by considering 
the performance of LWmean and LW variance in these settings. Datasets 1, 3, 4 
and 6 have well 'separated' means and similar variances, so one might expect 
algorithms ordering by means to perform better; whereas datasets 2 and 5 have 
well separated variances and similar means, so perhaps the algorithms ordering by 
variances might do well here. Thirdly, the Kmix algorithm should be able to 
choose the best ordering and it is of interest to compare the results from this 
algorithm with an adaptive version of the individual kernels. 
The simulation results from these datasets are presented in Table [1} These give 
both the relative VPD for each method, but also an estimated mean ESJD for 
each method. 

As would be hoped, a very strong correlation between lower VPD and higher 
ESJD is evident for the SMC/ASMC algorithms, this empirically supports the use 
of ESJD as the chosen criteria for adapting the MCMC kernels. 
There is relatively little difference across scenarios between the fixed and adaptive 
random walk methods. Furthermore, the adaptive random walk settles on a similar 
scaling as the fixed scaled version in datasets 3 and 4, whereas in datasets 1, 2, 5 
and 6, RWadaptive settles to values below RWfixed. In datasets 1, 2, 4 and 5, the 
adaptive RW outperformed the fixed equivalent (though the difference was 

negligible in datasets 4 and 5); this is likely due to the fact that the covariance was 
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Method 



Rel. 
VPD 



JD 



Dataset 1 

Ace. 



Propn. 



LWvariance 


f 


1.869 


0.3 


0.941 




LWmean 


1.189 


1.818 


0.32 


0.956 




Kmix 


1.258 


1.845 


0.317 


LWm 0.963 
LWv 0.958 


LWm 0.785 
LWv 0.215 


RWadaptivc 


2.391 


0.708 


0.21 


0.946 




AMCMC 


2.396 


0.575 


0.13 


1.073 




RWfixed 


3.414 


0.641 


0.18 


1.064 










Dataset 2 






LWvariance 


1 


9.139 


0.873 


0.978 




Kmix 


2.843 


9.023 


0.854 


LWm 0.984 
LWv 0.978 


LWm 0.005 
LWv 0.995 


AMCMC 


28.333 


0.197 


0.019 


1.073 




LWmean 


112.23 


1.869 


0.129 


0.969 




RWadaptive 


188.094 


0.77 


0.134 


0.584 




RWfixed 


219.907 


0.596 


0.041 


1.064 










Dataset 3 






LWmean 


1 


6.38 


0.792 


0.98 




Kmix 


1.54 


6.378 


0.806 


LWm 0.979 


LWm 1 


AMCMC 


7.465 


0.847 


0.146 


1.073 




RWfixed 


40.538 


1.124 


0.277 


1.064 




RWadaptivc 


45.739 


1.057 


0.369 


1.045 




LWvariance 


148.827 


0.737 


0.064 


0.966 










Dataset 4 






LWmean 


1 


7.132 


0.875 


0.98 




Kmix 


1.099 


7.127 


0.877 


LWm 0.979 


LWm 1 


AMCMC 


24.024 


0.462 


0.057 


1.073 




RWadaptive 


48.606 


1.143 


0.274 


1.086 




RWfixed 


51.919 


1.167 


0.298 


1.064 




LWvariance 


1096.167 


0.632 


0.027 


0.961 










Dataset 5 






AMCMC 


0.883 


0.356 


0.04 


0.849 




Kmix 


1 


2.258 


0.234 


LWm 0.964 
LWv 0.971 


LWm 0.044 
LWv 0.956 


LWvariance 


1.151 


2.284 


0.183 


0.971 




LWmean 


2.792 


1.007 


0.092 


0.961 




RWadaptive 


4.923 


0.847 


0.205 


0.435 




RWfixed 


5.187 


0.56 


0.055 


0.84 










Dataset 6 






LWmean 


1 


4.099 


0.277 


0.972 




Kmix 


1.018 


3.994 


0.363 


LWm 0.973 


LWm 1 


AMCMC 


1.556 


0.211 


0.04 


0.849 




RWfixed 


3.244 


0.996 


0.429 


0.84 




RWadaptivc 


3.259 


0.93 


0.192 


0.693 




LWvariance 


3.951 


1.951 


0.13 


0.944 





Table 1: Rel. VPD is relative VPD, JD is the mean square jumping distance, Ace 
is the mean final acceptance probability, h is the mean final scaling by kernel and 
Propn is the mean final kernel proportions. The kernels 'LWm' and 'LWv' indicate 
respectively a Liu/West proposal ordering on means or variances. 

29 



not a good estimate and the adaptive version of the algorithm was able to rescale 

to compensate for this. In datasets 3 and 6, the fixed random walk marginally 

outperformed the adaptive. 

The 'correctly ordered' sequential Liu/West algorithms considerably outperform 

the sequential RW-based methods in all six datasets and the incorrectly ordered 

versions perform worse or as poorly as the RW. For the Liu/West proposals, the h 

selected in each dataset was very close to 1: this special value corresponds to an 

independence kernel in the form of a moment-matched Gaussian approximation of 

the target. This is of interest as, in combination with the high acceptance rates of 

between 80-87% in datasets 2-4, suggests that the 'correct' ordering makes the 

target, ostensibly a very complex density function, approximately Gaussian in 

these cases. 

The Kmix algorithm is able to choose between orderings; the advantages of this 

are clearly evidenced in the results, as it selects the best ordering in each case, 

with the exception of dataset 1 (where the means and variances are both similar). 

The Kmix sampler settles almost unanimously on one ordering above the others. 

These results show empirically that there is not much difference in using a single 

(correctly chosen) kernel compared with using a selection of kernels. 

The performance of AMCMC was surpassed in all cases by the Kmix algorithm 

with the exception of dataset 5, where AMCMC was the best performing 

algorithm. In this latter case and in dataset 6, neither AMCMC nor the 

SMC/ASMC algorithms performed well. AMCMC outperformed RWadaptive in 

each case apart from in dataset 1, where the difference was small. However, the 

results show the average jumping distance of the kernel used in the ASMC 
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algorithm was greater than that of AMCMC in all cases, suggesting ASMC is able 
to adapt better to well-mixing kernels. To make this comparison more clear, two 
MCMC algorithms were run on each data-set, one using the final kernel found by 
AMCMC and one using a kernel based on the ASMC run, with the final estimated 
covariance matrix and the final mean value of the tuning parameter. The resulting 
MCMC algorithms performed very similarly in 3 cases (VPD of the two MCMC 
algorithms within 10% of each other) and the kernel found by ASMC performed 
better in the other 3 (VPD reduced by 30%, 40% and 80%). 

6 Discussion 

This paper introduces a new method for automatically tuning and choosing 
between different MCMC kernels. Where MCMC based SMC code already exists, 
adapting the hs would be a relatively straightforward means of enhancing 
performance, the main effort being in calculating the ratio of the proposed 
particles in the accept/reject step. Probably the most important conclusion from 
the simulation studies presented is that there is not much lost in terms of 
performance in the adaption process - the Kmix algorithm performed comparably 
to the respective best performing individual component and the adaptive random 
walk Metropolis performed similarly to the fixed, approximately optimally scaled 
version. 

Although the method as presented has assumed that i.i.d. observations are 
available from the likelihood, ASMC readily extends to the case of a dependent 
sequence. Furthermore, the extension to general sequences of target densities is 
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immediate, and implied by the choice of notation in Algorithm [3j The theoretical 
results presented in section [4] only apply to a one-dimensional h, in the case that 
the tuning parameter is a vector, the proposed algorithm and theoretical results 
still apply (with slight modifications), but convergence is likely to be at a slower 
rate. 

The main assumption of ASMC is that a good h at time t is likely also to perform 
well at time t + 1. One piece of evidence that supports this assumption is that the 
resampling frequency decreases with an increasing number of observations 



(Chopin, 2002). This implies that, although 7i"i and 7r 2 may be quite different, 7Tiooi 
and vr 10 02 are likely to be less so, provided that the data provides sufficient 
information on the parameters. As mentioned earlier in the text, the assumption 
of similar successive target densities is also required for the non-adaptive version 



(Chopin, 2002 Del Moral et al., 2006). 



ASMC can be easily extended by considering other proposal densities. For 
example it is possible to formulate a T-distributed version of the Liu/West 
proposal, this allows for heavier tailed proposals, the heaviness of which can be 
selected automatically by adaptively choosing the number of degrees of freedom; 
this t-based proposal includes the Liu/West as a special case. Other interesting 



algorithms can be formulated using DE proposals (Ter Braak, 2006) (which 



generalises the snooker algorithm of Gilks et al. (1994)) or regional MCMC 



proposals (Roberts and Rosenthal, 2009 Craiu et al., 2009) - both of which appeal 



strongly to the particle structure of the new method. 
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A Proof of Proposition |4.1 



Let A^ = A(9^\ 9'V') ie the observed A for the j'th particle and I denote the 
indicator function. The collection {h^\ 1/M}|£ 1 is an iid sample from n(h), but 
with weights defined as, 



E" /(AW)' 

the weighted particle set, {h^\ W^}^, has empirical density, 



M 



3=1 



Define a discrete random variable H*, which takes value hV' with probability W^\ 
For h* G R, 



p(^ < h*) = j>^) < h*) = M ^i J ' ;; - 

7=1 m2ji=iA a ) 



In the limit as M — > oo, (0, 9') ~ tt(9)K(9, 9') the strong law of large numbers 
implies, 



jg ££i /(A^)I(/^> < ft*) E^ )AW) [/(A)I(ff</Q] 

^E,=i/(A (i) ) r E> )AW) [/(A)] 

E H [E e>@ , lH if(A)\H = h]I(H<V 



E H [E e , e , lH [f(A)\H = h]] 
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using the properties of conditional expectation. To complete the proof, observe 
that E e ,e'|# [/(A)|tf = h] = w(H), so, 



a/->oo l - ; " E n(h) [w(H)] fw{h)ir{h)dh ' 



convergence in distribution follows as required. □ 



B Proof of Theorem 14.1 



The proof proceeds in two parts and starts by observing that 
7i-(")(/j,) = 7l(h) exp{n/ n } where, 



1 - 

n ^— » 



n 
i=l 



In the first part, the following results will be proved: 

• There exists a function, / : % — > M, such that sup/> eW |/ n — /| < kfn~ a . 

• h opt is the unique global maximum of /. 

• / is twice differentiable in an interval containing h opt . 

In the second part of the proof, these results will be used to show that as n — > oo, 

7T"( n ) (h) approaches a Dirac mass centred on h opt . 

Part 1 

Claim that f(h) = log (a + g{h)). It is easy to show 

su P/ie« l( a + 9 {t) )/( a + 9) ~M < kt~ a , where h = k g /mi heH {a + g(h)} < oo by 

assumption. Put k m = k L + 1 > ki + k[t~^/2 for all t > exp{— 2/kia}. For 



sufficiently large (finite) t and any h E "H, (a + g®)/(a + g) is close to 1, a Taylor 
series argument can therefore be applied to give, 

-k m r a < -kt~ a - kr 2a /2 < io g (i - ht- a ) < io g [(o + (? (t) )/(« + g)\ < ht- a . 

The preceding argument shows that, 



sup 

hen 



a + g® 

logi : 

a + g 



sup | log(a + g (t) ) - log(a + p)| < A; m t" 



For all /i e "H, 



f n 
|/ n -log(o + y)| < - V|log(a + # w )-log(a + #)| 

:=1 

n 



h U 



n 
t=i 

< k f n~ a , 

as required. If g is twice differentiable in an interval ICK, where X 3 /i opt then, 
being a continuous function of g, / is also twice differentiable on X. That h opt is 
the unique global maximum of / is now implied by the assumptions on g and the 
strict increasing monotonicity of the logarithm. 
Part 2 

In this part, the properties of / will be used to show that for any interval 
containing h opt as an interior point and as n — > oo, the probability that h belongs 
to that interval tends to 1. 
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Let X denote the compliment of X in %. Let Xo D "H be any interval containing 
h opt as an interior point. By virtue of the global uniqueness of h opt , there exists an 
open interval X\ D Xo also containing h op t such that f"(h) < for all h E Xi and 
with the property, inf fceXl / > sup^i /■ 
Then for all open intervals X 2 D X 1; define, 



sup{/(/i op t) - /(/i)} = Ci, 
hex 2 



inf{/(/iopt)-/(/i)} = e 2 . 



The strict concavity of / on I 1 implies e 2 > e x (note the strict inequality). 
Consider the probability of h G X after n updates, 



P(/i G X ) > P(/i G XO 



J Xi nW(h)dh J Xi n(h) exp{nf n (h)}dh 

J^)(h)dh = f^n(h)exp{nf n (h)}dh' 

1 
> - 



f x n(h) exp{n/„(/i)}d/i ' 
L n(h) cxp{nf„(h)}dh 



since for any positive reals a±, a 2 and a^, if a\ > a 2 then - a _l a > 



Q2 



a2+az l+a-i/a'2 ' 

By uniform convergence of f n , the quotient of integrals in the denominator can be 
bounded above by, 



f x n(h)exp{nf n (h)}dh f x ir(h) exp{nf(h) + kfii 1 a }dh 

j x ir(h)exp{nf n (h)}dh ~ j x ii(h)exp{nf(h) — kfn l ~ a }dh^ 

J Xi n(h) exp{nf(h opt ) - ne 2 + kfn l ~ a }dh 



< 



L ir(h) exp{nf(h op t) - ne 1 - kfn 1 a }dh : 



P.W^GXQ 

" ¥^h¥i7) eMniei - e2) + 2kfn h 



— >■ as n — >■ oo, 
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since e\ — €2 < 0. Therefore F(h G X ) — > 1 as n — > 00. Since the choice of X E3 h op t 
was arbitrary, it may be made infinitesimally narrow and still, after enough 
iterations of the sampler P(/i e Xq) — > 1. This implies that ~K^ n '{h) tends in 
distribution to a Dirac mass centred on h op t and establishes the claim. D 
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